Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint connector bugfixes #10647

Merged
merged 6 commits into from
Oct 5, 2024
Merged

Conversation

jstjohn
Copy link
Collaborator

@jstjohn jstjohn commented Sep 26, 2024

What does this PR do ?

Get checkpoint connector working for bionemo (see NVIDIA/bionemo-framework#180)

Changelog

  • Use new nemo2 standard /weights and /context subdirectory scheme so checkpoint loaders work properly with checkpoints created by this method.
  • dedup ckpt_to_dir
  • Other changes necessary to allow checkpoint transformation outside of training resumption.

@jstjohn jstjohn marked this pull request as draft September 26, 2024 23:00
@jstjohn jstjohn self-assigned this Sep 26, 2024
@jstjohn jstjohn marked this pull request as ready for review September 26, 2024 23:31
@jstjohn jstjohn force-pushed the jstjohn/nemo_checkpoint_connector_fixes branch 4 times, most recently from 6069101 to a4ad2d7 Compare September 27, 2024 22:42
nemo/lightning/io/connector.py Fixed Show resolved Hide resolved
@akoumpa akoumpa added the r2.0.0 label Sep 30, 2024
Copy link
Contributor

[🤖]: Hi @jstjohn 👋,

I just wanted to let you know that, you know, a CICD pipeline for this PR just finished successfully ✨

So it might be time to merge this PR or like to get some approvals 🚀

But I'm just a 🤖 so I'll leave it you what to do next.

Have a great day!

//cc @ko3n1g

@jstjohn jstjohn force-pushed the jstjohn/nemo_checkpoint_connector_fixes branch from a4ad2d7 to 04a9660 Compare September 30, 2024 19:56
@akoumpa akoumpa added Run CICD and removed Run CICD labels Sep 30, 2024
akoumpa
akoumpa previously approved these changes Sep 30, 2024
Copy link
Member

@akoumpa akoumpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot.

akoumpa
akoumpa previously approved these changes Sep 30, 2024
@akoumpa akoumpa added Run CICD and removed Run CICD labels Sep 30, 2024
@akoumpa
Copy link
Member

akoumpa commented Sep 30, 2024

CI seems to have an issue with CI assets; will retry it later.

@akoumpa akoumpa added Run CICD and removed Run CICD labels Oct 1, 2024
@jstjohn jstjohn force-pushed the jstjohn/nemo_checkpoint_connector_fixes branch from acd3b68 to ffa03ab Compare October 1, 2024 15:57
@akoumpa akoumpa added Run CICD and removed Run CICD labels Oct 4, 2024
@akoumpa akoumpa force-pushed the jstjohn/nemo_checkpoint_connector_fixes branch from 32365ea to 499db1c Compare October 4, 2024 20:58
…_utils instead (to avoid side-effects to nemo1.0 ckpts)

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa force-pushed the jstjohn/nemo_checkpoint_connector_fixes branch from 499db1c to 41b9a65 Compare October 4, 2024 20:58
@akoumpa akoumpa added Run CICD and removed Run CICD labels Oct 4, 2024
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
@akoumpa akoumpa added Run CICD and removed Run CICD labels Oct 4, 2024
@akoumpa akoumpa enabled auto-merge (squash) October 4, 2024 22:15
akoumpa
akoumpa previously approved these changes Oct 4, 2024
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Copy link
Contributor

github-actions bot commented Oct 5, 2024

[🤖]: Hi @jstjohn 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully

So it might be time to merge this PR or get some approvals

I'm just a bot so I'll leave it you what to do next.

//cc @pablo-garay @ko3n1g

@akoumpa akoumpa merged commit 4034266 into main Oct 5, 2024
153 of 161 checks passed
@akoumpa akoumpa deleted the jstjohn/nemo_checkpoint_connector_fixes branch October 5, 2024 04:28
ko3n1g pushed a commit that referenced this pull request Oct 5, 2024
* Update checkpoint connector nemo_save to match current folder heirarchy

Signed-off-by: John St John <jstjohn@nvidia.com>

* Address PR feedback

Signed-off-by: John St John <jstjohn@nvidia.com>

* Address divergent code issue in ckpt_to_dir

Signed-off-by: John St John <jstjohn@nvidia.com>

* revert changes to nemo.utils.model_utils and use model.lightning.ckpt_utils instead (to avoid side-effects to nemo1.0 ckpts)

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
@akoumpa
Copy link
Member

akoumpa commented Oct 7, 2024

This needs #10786

monica-sekoyan pushed a commit that referenced this pull request Oct 14, 2024
* Update checkpoint connector nemo_save to match current folder heirarchy

Signed-off-by: John St John <jstjohn@nvidia.com>

* Address PR feedback

Signed-off-by: John St John <jstjohn@nvidia.com>

* Address divergent code issue in ckpt_to_dir

Signed-off-by: John St John <jstjohn@nvidia.com>

* revert changes to nemo.utils.model_utils and use model.lightning.ckpt_utils instead (to avoid side-effects to nemo1.0 ckpts)

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
yashaswikarnati pushed a commit that referenced this pull request Oct 20, 2024
* Update checkpoint connector nemo_save to match current folder heirarchy

Signed-off-by: John St John <jstjohn@nvidia.com>

* Address PR feedback

Signed-off-by: John St John <jstjohn@nvidia.com>

* Address divergent code issue in ckpt_to_dir

Signed-off-by: John St John <jstjohn@nvidia.com>

* revert changes to nemo.utils.model_utils and use model.lightning.ckpt_utils instead (to avoid side-effects to nemo1.0 ckpts)

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
akoumpa added a commit that referenced this pull request Oct 21, 2024
* Update checkpoint connector nemo_save to match current folder heirarchy

Signed-off-by: John St John <jstjohn@nvidia.com>

* Address PR feedback

Signed-off-by: John St John <jstjohn@nvidia.com>

* Address divergent code issue in ckpt_to_dir

Signed-off-by: John St John <jstjohn@nvidia.com>

* revert changes to nemo.utils.model_utils and use model.lightning.ckpt_utils instead (to avoid side-effects to nemo1.0 ckpts)

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
pablo-garay pushed a commit that referenced this pull request Oct 22, 2024
…0.0` (#10770)

* Checkpoint connector bugfixes (#10647)

* Update checkpoint connector nemo_save to match current folder heirarchy

Signed-off-by: John St John <jstjohn@nvidia.com>

* Address PR feedback

Signed-off-by: John St John <jstjohn@nvidia.com>

* Address divergent code issue in ckpt_to_dir

Signed-off-by: John St John <jstjohn@nvidia.com>

* revert changes to nemo.utils.model_utils and use model.lightning.ckpt_utils instead (to avoid side-effects to nemo1.0 ckpts)

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ko3n1g <ko3n1g@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use ckpt_to_weights_subdir in restore (#10786)

* use ckpt_to_weights_subdir in restore

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* make ckpt_to_{weight,context}_subdir idempotent

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* py syntax fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: ko3n1g <ko3n1g@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: John St. John <jstjohn@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: ko3n1g <ko3n1g@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 5, 2024
* Update checkpoint connector nemo_save to match current folder heirarchy

Signed-off-by: John St John <jstjohn@nvidia.com>

* Address PR feedback

Signed-off-by: John St John <jstjohn@nvidia.com>

* Address divergent code issue in ckpt_to_dir

Signed-off-by: John St John <jstjohn@nvidia.com>

* revert changes to nemo.utils.model_utils and use model.lightning.ckpt_utils instead (to avoid side-effects to nemo1.0 ckpts)

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants